General introduction to data visualization
Introduction to ggplot2
Grammar of graphics
A case study using NBA data
Useful packages and extensions for ggplot2
10/10/2020
General introduction to data visualization
Introduction to ggplot2
Grammar of graphics
A case study using NBA data
Useful packages and extensions for ggplot2
Exploratory data analysis
Explore pattern, trend, and distribution of one variable
Explore association between variables
Statistical analysis
Report your results and communicate with non-statisticians
A more clear way of presenting findings
Attract your audiences
For fun…
## # A tibble: 12 x 6 ## dataset meanx meany sdx sdy corxy ## <chr> <dbl> <dbl> <dbl> <dbl> <dbl> ## 1 bullseye 54.3 47.8 16.8 26.9 -0.069 ## 2 circle 54.3 47.8 16.8 26.9 -0.068 ## 3 dino 54.3 47.8 16.8 26.9 -0.064 ## 4 dots 54.3 47.8 16.8 26.9 -0.06 ## 5 h_lines 54.3 47.8 16.8 26.9 -0.062 ## 6 high_lines 54.3 47.8 16.8 26.9 -0.069 ## 7 slant_down 54.3 47.8 16.8 26.9 -0.069 ## 8 slant_up 54.3 47.8 16.8 26.9 -0.069 ## 9 star 54.3 47.8 16.8 26.9 -0.063 ## 10 v_lines 54.3 47.8 16.8 26.9 -0.069 ## 11 wide_lines 54.3 47.8 16.8 26.9 -0.067 ## 12 x_shape 54.3 47.8 16.8 26.9 -0.066
A statistical plot contains much more information than a table of summary statistics!
One variable: Histogram, Bar chart, Density plot…
Two variables: Scatter plot, Box plot, Violin Plot…
Multiple variables: Heatmap…
Checking normality: qqplot…
Think of your data and variables carefully, and choose the most appropriate statistical plot.
## TEAM SEASON WIN. PTS OFFRTG DEFRTG PACE REGION ABV ## 1 Atlanta Hawks 2015-2016 0.585 102.8 104.6 100.8 97.63 East ATL ## 2 Atlanta Hawks 2018-2019 0.354 113.3 107.5 113.1 104.56 East ATL ## 3 Atlanta Hawks 2017-2018 0.293 103.4 104.4 110.1 98.76 East ATL ## 4 Atlanta Hawks 2016-2017 0.524 103.2 104.5 105.2 97.76 East ATL ## 5 Boston Celtics 2015-2016 0.585 105.7 105.8 102.5 99.43 East BOS ## 6 Boston Celtics 2016-2017 0.646 108.0 110.6 108.0 97.21 East BOS
WIN.: Winning rate, which is the percentage of games played that a team has won.
PTS: The number of points scored.
OFFRTG: Offensive Rating, which measures a team’s points scored per 100 possessions.
DEFRTG: Defensive Rating, which is the number of points allowed per 100 possessions by a team.
PACE: Pace, which is the number of possessions per 48 minutes for a team.
REGION: East/West.
ABV: The abbreviation of a team.
be hard to read if labels and legends are not clear
confuse people if it is not well-designed
deliver misleading information (sometimes in purpose)
The histograms of winning rate in different regular NBA seasons and regions generated by ggplot2 and graphics packages:
Code in ggplot2:
ggplot(data = sub.dt, aes(x = WIN.)) + geom_histogram(binwidth = 0.1, color = "black") + facet_grid(REGION ~ SEASON)
Code in graphics package
par(mfrow = c(2, 2), mar = c(2, 2, 3, 1))
for(i in levels(sub.dt$REGION)){
for(j in levels(sub.dt$SEASON)){
subdata <- subset(sub.dt, REGION == i & SEASON == j)
hist(sub.dt$WIN., breaks = seq(0, 1, 0.1),
main = paste(i, j, sep = " ,"))
}
}
Idea: graph is a combination of independent building blocks.
Data that you want to visualize and a set of aesthetic mappings describing how variables in the data are mapped to aesthetic attributes.
Layers made up of geometric elements and statistical transformation. Geometric objects, geoms for short, such as points, lines, polygons, etc. Statistical transformations, stats for short, summarize data in many useful ways.
The scales map values in the data space to values in an aesthetic space, whether it be color, or size, or shape.
A coordinate system, coord for short, describes how data coordinates are mapped to the plane of the graphic.
A facet describes how to break up the data into subsets and how to display those subsets as small multiples.
A theme which controls the finer points of display, like the font size and background color.
ggplot() is always the first line of your code.
We can specify the data set and the aesthetics mapping variables in the ggplot().
p <- ggplot(data = nba.data, aes(x = OFFRTG, y = WIN.)) p
Map the variables in the data to the components in the plot
x: x axis
y: y axis
color: color of the boundary of a symbol
fill: color of the inside of a symbol
shape: shape of points, solid point, circle, triangle…
size: size of points
linetype: type of lines, solid line, dashed line…
…
Geometries are the actual graphical elements displayed in a plot. They can visualize the mapping variables (specified in aes()) from the data.
We use + to connect multiple geometry functions
p + geom_point()
data and aes in geom function. They don’t have to be the same as those in ggplot().ggplot() + geom_point(data = nba.data, aes(x = DEFRTG, y = WIN.))
geom functionp <- ggplot(data = nba.data, aes(x = WIN.)) p + geom_histogram(binwidth = 0.1) p + geom_density()
geom functionp <- ggplot(data = nba.data, aes(x = OFFRTG, y = WIN.)) p + geom_point(); p + geom_line(); p + geom_density_2d(); p + geom_smooth(formula = y ~ x, method = "lm")
geom functionp <- ggplot(data = nba.data, aes(x = SEASON, y = WIN.)) p + geom_boxplot() p + geom_violin()
geom layersggplot(data = nba.data, aes(x = WIN.)) + geom_histogram(aes(y = ..density..), binwidth = 0.1, color = "black") + geom_density()
geom layersggplot(data = nba.data, aes(x = OFFRTG, y = WIN.)) + geom_point() + geom_smooth(formula = y ~ x, method = "lm")
geom layersggplot(data = nba.data, aes(x = SEASON, y = WIN.)) + geom_violin() + geom_boxplot(width = 0.2)
geom functions is importantggplot(data = nba.data, aes(x = SEASON, y = WIN.)) + geom_boxplot(width = 0.2) + geom_violin()
Facet function can help you make panel plot very easily
facet_wrap wraps a 1d sequence of panels into 2d.
p <- ggplot(data = nba.data, aes(x = OFFRTG, y = WIN.)) + geom_point() + geom_smooth(formula = y ~ x, method = "lm", se = FALSE) p + facet_wrap(~SEASON)
facet_grid forms a matrix of panels defined by row and column faceting variables.p <- ggplot(data = nba.data, aes(x = OFFRTG, y = WIN.)) + geom_point() + geom_smooth(formula = y ~ x, method = "lm", se = FALSE) p + facet_grid(REGION ~ SEASON)
The scale functions control how the plot maps data values to the visual values of an aesthetic, for instance,
scale_x_continuous
scale_y_discrete
scale_color_gradient
scale_fill_manual
The format of scale functions is always scale_element1_element2. The first element represents the aesthetics, and the second element represents the characteristics of variables.
You can also specify the label of axis or legends in the scale function.
R color cheat sheet: https://www.nceas.ucsb.edu/~frazier/RSpatialGuides/colorPaletteCheatsheet.pdf
p <- ggplot(data = nba.data) +
geom_point(aes(x = OFFRTG, y = DEFRTG, color = WIN., shape = REGION))
p + scale_x_continuous(name = "offensive rate", limits = c(97, 116)) +
scale_y_reverse(name = "defensive rate") +
scale_color_gradient(name = "winning rate", low = "green", high = "red") +
scale_shape_discrete(name = "region", labels = c("EAST", "WEST"))
coord_* function control the transformation of the coordinate systems, such as coord_trans(y = "sqrt").
We can change the theme of plot using theme_* function
labs function can set the title, subtitle and caption of your plot.
theme function is a powerful way to customize the non-data components of your plots: i.e. titles, labels, fonts, background, grid lines, and legends. See R help for details.
ggsave can save the plot to your local drive.
https://www.rstudio.com/wp-content/uploads/2015/03/ggplot2-cheatsheet.pdf
R help is also a great resource.
Wickham, H. (2016). ggplot2: elegant graphics for data analysis. Springer.
ggplot(data = nba.data, aes(x = OFFRTG, y = DEFRTG, size = WIN.)) +
geom_point(aes(color = REGION), shape = 1) +
geom_text(data = subset(nba.data, WIN. > 0.65),
aes(label = ABV), size = 1.5) +
geom_text_repel(data = subset(nba.data, WIN. < 0.3),
aes(label = ABV), size = 1.5,
min.segment.length = 0, box.padding = 0.3) +
facet_wrap(~SEASON) +
theme_bw() +
scale_x_continuous("Offensive Rate") +
scale_y_reverse("Defensive Rate", limits = c(118, 95)) +
scale_color_manual("Region" ,values = c("blue3", "red3")) +
scale_size_continuous("Winning Rate", breaks = c(0.2, 0.4, 0.6)) +
theme(legend.position = "bottom",
panel.grid.minor = element_blank())
gridExtra: A package can help you arrange multiple plots on a page
GGally: An extension to reduce the complexity of combining geometric objects with transformed data
ggExtra: A package which can add marginal density plots or histograms to ggplot2 scatter plots.
ggrepel: A convenient package for geom_text()
gganimate: A grammar of animated graphics
more information: https://exts.ggplot2.tidyverse.org/gallery/
ggpairs: Make a matrix of plots with a given data set.
ggcorr: plot a correlation matrix (heatmap) with ggplot2
ggpairs(data = nba.data, 3:7) ggcorr(data = nba.data[, 3:7])
ggMarginal: Create a ggplot2 scatter plot with marginal density plots (default) or histograms, or add the marginal plots to an existing scatter plot.p <- ggplot(nba.data, aes(x = OFFRTG, y = DEFRTG, color = REGION)) + geom_point() + theme_bw() + theme(legend.position = "bottom") ggMarginal(p, groupColour = TRUE, groupFill = TRUE)
ggplot(data = nba.data, aes(x = OFFRTG, y = DEFRTG, size = WIN.)) +
geom_point(aes(color = REGION), shape = 1) +
geom_text_repel(aes(label = ABV), size = 1.5, box.padding = 0.3) +
theme_bw() +
scale_y_reverse(limits = c(120, 97)) +
scale_color_manual(values = c("blue3", "red3")) +
# Here comes the gganimate specific bits
labs(title = 'SEASON: {closest_state}', x = 'OFFRTG', y = 'DEFRTG') +
theme(title = element_text(size = 5),
text = element_text(size = 2)) +
transition_states(SEASON,
transition_length = 2,
state_length = 1)